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The PAC-Bayesian approach is a powerful set of techniques to derive non- 
asymptotic risk bounds for random estimators. The corresponding optimal 
distribution of estimators, usually called the Gibbs posterior, is unfortunately 
intractable. One may sample from it using Markov chain Monte Carlo, but 
this is often too slow for big datasets. We consider instead variational approx¬ 
imations of the Gibbs posterior, which are fast to compute. We undertake 
a general study of the properties of such approximations. Our main finding 
is that such a variational approximation has often the same rate of conver¬ 
gence as the original PAC-Bayesian procedure it approximates. We specialise 
our results to several learning tasks (classification, ranking, matrix comple¬ 
tion), discuss how to implement a variational approximation in each case, 
and illustrate the good properties of said approximation on real datasets. 


1. Introduction 


A Gibbs posterior, also known as a PAC-Bayesian or pseudo-posterior, is a probability 
distribution for random estimators of the form: 


P\(M) 


exp[—Ar n (fl)] 

r r \ i i 7r(df'). 
J exp[-Ar n Jd7r 


More precise definitions will follow, but for now, 9 may be interpreted as a parameter 
(in a finite or infinite-dimensional space), r n {9) as an empirical measure of risk (e.g. 
prediction error), and 7r(d@) a prior distribution. 

We will follow in this paper the PAC (Probably Approximatively Correct)-Bayesian 


McAllester, 1998, Catoni, 

2004 ; see 

Caton 

2007 for an exhaustive study, and Jiang and 

Tanner (20081, Yang 2004 

, Zhang 

2006;, 

Dalalyan and Tsybakov 2008 for related per- 


spectives (such as the aggregation of estimators in the last 3 papers). There, p\ appears 
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as the probability distribution that minimises the upper bound of an oracle inequality 
on the risk of random estimators. The PAC-Bayesian approach offers sharp theoretical 
guarantees on the properties of such estimators, without assuming a particular model 
for the data generating process. 

The Gibbs posterior has also appeared in other places, and under different motiva¬ 
tions: in Econometrics, as a way to avoid direct maximisation in moment estimation 
|Chernozhukov and Hong 2003 ; and in Bayesian decision theory, as as way to define 


a Bayesian posterior distribution when no likelihood has been specified [Bissiri et al. 


2013 . Another well-known connection, although less directly useful (for Statistics), is 
with thermodynamics, where r n is interpreted as an energy function, and A as the inverse 
of a temperature. 

Whatever the perspective, estimators derived from Gibbs posteriors usually show ex¬ 
cellent performance in diverse tasks, such as classification, regression, ranking, and so 
on, yet their actual implementation is still far from routine. The usual recommendation 
[Dalalyan and Tsybakov 2012, Alquier and Biau, 2013, Guedj and Alquier 2013] is to 


sample from a Gibbs posterior using MCMC [Markov chain Monte Carlo, see e.g. Green 


et al. 2015 ; but constructing an efficient MCMC sampler is often difficult, and even 


efficient implementations are often too slow for practical uses when the dataset is very 
large. 

In this paper, we consider instead VB (Variational Bayes) approximations, which have 
been initially developed to provide fast approximations of ‘true’ posterior distributions 
(i.e. Bayesian posterior distributions for a given model); see Jordan et al. 119991, MacKay 
[2002 


and Chap. 10 in Bishop 12006 


Our main results are as follows: when PAC-Bayes bounds are available - mainly, when 
a strong concentration inequality holds - replacing the Gibbs posterior by a variational 
approximation does not affect the rate of convergence to the best possible prediction, 
on the condition that the Kiillback-Leibler divergence between the posterior and the 
approximation is itself controlled in an appropriate way. 

We also provide empirical bounds, which may be computed from the data so as to 
ascertain the actual performance of estimators obtained by variational approximation. 
All the results gives strong incentives, we believe, to recommend Variational Bayes as 
the default approach to approximate Gibbs posteriors. 

The rest of the paper is organized as follows. In Section [2] we introduce the notations 
and assumptions. In Section [3] we introduce variational approximations and the corre¬ 
sponding algorithms. The main results are provided in general form in Section [4j in 
Subsection |4.1] we give results under the assumption that a Hoeffding type inequality 


holds (slow rates) and in Subsection 4.2 we give results under the assumption that a 
Bernstein type inequality holds (fast rates). Note that for the sake of shortness, we will 
refer to these settings as “Hoeffding assumption” and “Bernstein assumption” even if 
this terminology is non standard. We then apply these results in various settings: clas¬ 
sification (Section [5]), convex classification (Section [6]), ranking (Section [ 7 ]), and matrix 
completion (Section [8]). In each case, we show how to specialise the general results of 
Section [4] to the considered application, so as to obtain the properties of the VB approx- 


2 




















































imation, and we also discuss its numerical implementation. All the proofs are collected 
in the Appendix. 


2. PAC-Bayesian framework 

We observe a sample (Ad, Yf),..., (A n , Y n ), taking values in X x A, where the pairs 
(A i,Yi) have the same distribution P. We will assume explicitly that the (Aj,Y))’s are 
independent in several of our specialised results, but we do not make this assumption 
at this stage, as some of our general results, and more generally the PAC-Bayesian 
theory, may be extended to dependent observations; see e.g. Alquier and Li 12012 j. The 
label set y is always a subset of M. A set of predictors is chosen by the statistician: 
{fe : X —> M, 9 G 0}. For example, in linear regression, we may have: fe(x) = (9 , x), the 
inner product of X = lR rf , while in classification, one may have fo(x) = k<6»,a;)>o £ {0,1}. 

We assume we have at our disposal a risk function R{9)] typically R{9) is a measure 
of the prevision error. We set R = R(9), where 9 E argmineii; i.e. A is an optimal 
predictor. We also assume that the risk function R{9) has an empirical counterpart 
r n (9), and set r n = r n {9). Often, R and r n are based on a loss function i : M 2 -» M; i.e. 
R(9) = E[£(Y, fe(X))] and f n (6) = \ Y2=\ fe( x i))- (In this paper, the symbol E 
will always denote the expectation with respect to the (unknown) law P of the (A,-, 1))’s.) 
There are situations however (e.g. ranking), where R and r n have a different form. 

We define a prior probability measure ir(-) on the set 0 (equipped with the standard 
cr-algebra for the considered context), and we let .M+(0) denote the set of all probability 
measures on 0. 

Definition 2.1 We define, for any A > 0, the pseudo-posterior p\ by 


P\{d6) = 


exp[—Ar n {9)\ 

f exp[— Ar n ]d7r 


7r(d$). 


The pseudo-posterior p\ (also known as the Gibbs posterior, Catoni 12004, 2007 


or 

the exponentially weighted aggregate, Dalalyan and Tsybakov |2008| ) plays a central 
role in the PAC-Bayesian approach. It is obtained as the distribution that minimises 
the upper bound of a certain oracle inequality applied to random estimators. Practical 
estimators (predictors) may be derived from the pseudo-posterior, by e.g. taking the 
expectation, or sampling from it. Of course, when exp[— Xr n (9)] may be interpreted as 
the likelihood of a certain model, p\ becomes a Bayesian posterior distribution, but we 
will not restrict our attention to this particular case. 

The following ‘theoretical’ counterpart of p\ will prove useful to state results. 


Definition 2.2 We define, for any A > 0, n\ as 

exp[-Ai2(0)] 

T\( d 9) = j - r VPU ndfl • 

j exp[—Aitjd7r 
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We will derive PAC-Bayesian bounds on predictions obtained by variational approx¬ 
imations of p\ under two types of assumptions: a Hoeffding-type assumption, from 
which we may deduce slow rates of convergence (Subsection |4.1[ ), and a Bernstein-type 
assumption, from which we may obtain fast rates of convergence (Subsection |4.2[ ). 

Definition 2.3 We say that a Hoeffding assumption is satisfied for prior tt when there 
is a function f and an interval I C such that, for any A € I, for any 9 e 0, 


7r (Eexp{A[l?(0) - r„(0)]}) 
7T (E exp (A[r n (0) - #(0)]}) 


< exp [/(A, 77,)] . 


( 1 ) 


Inequality <[T|) can be interpreted as an integrated version (with respect to tt) of Ho¬ 
effding’s inequality, for which /(A,n) X A 2 /n. In many cases the loss will be bounded 
uniformly over 0; then Hoeffding’s inequality will directly imply 0- The expectation 
with respect to tt in ([!]) allows us to treat some cases where the loss is not upper bounded 
by specifying a prior with sufficiently light tails. 


Definition 2.4 We say that a Bernstein assumption is satisfied for prior tt when there 
is a function g and an interval I C Ml such that, for any A £ I, for any 9 G 0, 


tt (Eexp {A[i?(0) - R] - A [r n (9) - r^]}) 
tt (Eexp { A[r n (0) - r n \ - A [R(9) - R] }) 


< tt (exp [g(X, n) [R(9) - R}] ) . (2) 


This assumption is satisfied for example by sums of i.i.d 
variables, see Subsection 2.4 p. 27 in Boucheron et al. |2013 


sub-exponential random 
when a margin assumption 
on the function R(-) is satisfied [Tsybakov 2004 . This is discussed in Section 4.2 Again, 
extensions beyond the i.i.d. case are possible, see e.g. Wintenberger |2010 for a survey 
and new results. In all these examples, the important feature of the function g that we 
will use to derive rates of convergence is the fact that there is a constant c > 0 such that 
when A = cn, g{ A, n) = g(cn, n) x n. 

As mentioned previously, we will often consider r n (0) = ^ £(Yi, fg(Xi)), how¬ 

ever, the previous assumptions can also be satisfied when r n {9) is a U-statistic, using 
Hoeffding’s decomposition of U-statistics combined with the corresponding inequality 
for sums of independent variables Hoeffding, 1948 . This idea comes from Clemengon 


et al. [2008 and we will use it in our ranking application. 


Remark 2.1 We could consider more generally inequalities of the form 


tt (Eexp (A[itl(0) — R] — A[r n (0) — ry]}) 1 , r (x n) \ m) _R^]) 

tt (Eexp (A[r n (0) — r n ] — A[i2(0) — R]\) J “ ^ P 


that allow to use the more general form of the margin assumption of Mammen and Tsy 
bakov 1991JIJ , Tsybakov 2004b PAC-Bayes bounds in this context are provided by Catoni 
2007]. However, the techniques involved would require many pages to be described so we 


decided to focus on the cases k = 0 and k = 1 to keep the exposition simple. 
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3. Numerical approximations of the pseudo-posterior 

3.1. Monte Carlo 


As already explained in the introduction, the usual approach to approximate p\ is 

2014j proposed tem- 


MCMC (Markov chain Monte Carlo) sampling. 


Ridgway et al. 

2006[)" 


pering SMC (Sequential Monte Carlo, e.g. Del Moral et ah 2006]) as an alternative 
to MCMC to sample from Gibbs posteriors: one samples sequentially from p\ t , with 
0 = Ao < ■ ■ • < At = A where A is the desired temperature. One advantage of this 
approach is that it makes it possible to contemplate different values of A, and choose 
one by e.g. cross-validation. Another advantage is that such an algorithm requires little 
tuning; see Appendix B for more details on the implementation of tempering SMC. We 
will use tempering SMC as our gold standard in our numerical studies. 

SMC and related Monte Carlo algorithms tend to be too slow for practical use in 
situations where the sample size is large, the dimension of 0 is large, or fg is expen¬ 
sive to compute. This motivates the use of fast, deterministic approximations, such as 
Variational Bayes, which we describe in the next section. 


3.2. Variational Bayes 

Various versions of VB (Variational Bayes) have appeared in the literature, but the main 
idea is as follows. We define a family T C Afi_(@) of probability distributions that are 
considered as tractable. Then, we define the VB-approximation of py. p\. 


Definition 3.1 Let 

P\ = argmin/C(p, pa), 

peJ 7 

where /C(p, p\) denotes the KL (Kiillback-Leibler) divergence of p\ relative to p: IC(m, p) = 
/log[^]dm if m <C p (i.e. p dominates m), IC(m, p) = +oo otherwise. 

The difficulty is to find a family T (a) which is large enough, so that p\ may be close 
to pa, and (b) such that computing p\ is feasible. We now review two types of families 
popular in the VB literature. 

• Mean field VB: for a certain decomposition 0 = @i x ... x 0^, T is the set of 
product probability measures 


-A mf = P e M\(e) : P(d0) = np.(dft),vi e {l,..., d}, Pi e A^(0i) \ . (3) 


i= 1 


The infimum of the KL divergence JC(p,p\), relative to p = fX- Pi satisfies the 
following fixed point condition |Parisi, 1988| Bishop, 2006, Chap. 10]: 


Vj e {1, • ■ • ,d} Pj(d0j) oc exp ij {-A r n (9) + log7r(0)} p*(d0j) j 7r(d%). 


( 4 ) 
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This leads to a natural algorithm were we update successively every pj until sta¬ 
bilization. 


Parametric family: 

T p = {p E ,A4+(0) : p{d9) = f(9;m)d9,m E M} ; 

and M is finite-dimensional; say J~ p is the family of Gaussian distributions (of di¬ 
mension d). In this case, several methods may be used to compute the infimum. As 
above, one may used fixed-point iteration, provided an equation similar to 0 is 
available. Alternatively, one may directly maximize/log[exp[— Xr n (9)\^(9)\p(d9) 
with respect to paramater m, using numerical optimization routines. This ap¬ 
proach was used for instance in Hoffman et al. 120131 with combination of some 
stochastic gradient descent to perform inference on a latent Dirichlet allocation 

2014), 


model. See also e.g. Khan 


Gaussian variational approximation. 


Khan et al. [2013 for efficient algorithms for 


In what follows (Subsections 4.1 and 4.2) we provide tight bounds for the prevision 
risk of p\. This leads to the identification of a condition on T such that the risk of p\ is 
not worse than the risk of p\. We will make this condition explicit in various examples, 
using either mean field VB or parametric approximations. 


Remark 3.1 An useful identity, obtained by direct calculations, is: for any p -C ir, 

log J exp [—Ar n (6>)] 7r(d$) = -A J r n (9)p(d9) - K{p, 7r) + JC(p, p\). (5) 

Since the left hand side does not depend on p, one sees that p\, which minimises JC(p, p\) 
over T, is also the minimiser of: 

p\ = arg min j J r n (9)p(d9) + ^/C(p, 7r) 

This equation will appear frequently in the sequel in the form of an empirical upper bound. 


4. General results 

This section gives our general results, under either a Hoeffding Assumption (Definition 


approximation, and how it relates to risks bounds for Gibbs posteriors. These results 
will be specialised to several learning problems in the following sections. 


2.3) or a Bernstein Assumption (Definition |2.4[), on risks bounds for the variational 


4.1. Bounds under the Hoeffding assumption 

4.1.1. Empirical bounds 


Theorem 4.1 Under the Hoeffding assumption (Definition 2.3), for any e > 0, with 
probability at least 1 — e we have simultaneously for any p E A4+(0), 

[m P < /r„d P+ /(A ’"> +c( G )+log(l) . 
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This result is a simple variant of a result in |Cato ni ('2007 but for the sake of com¬ 
pleteness, its proof is given in Appendix [Aj It gives us an upper bound on the risk 
of both the pseudo-posterior (take p = px) and its variational approximation (take 
p = p\). These bounds may be be computed from the data, and therefore provide a sim¬ 
ple way to evaluate the performance of the corresponding procedure, in the spirit of the 
first PAC-Bayesian inequalities |Shawe-Taylor and Williamson, 1997, McAllester, 1998 


1999 . However, this bound do not provide the rate of convergence of these estimators. 


For this reason, we also provide oracle-type inequalities. 


4.1.2. Oracle-type inequalities 


Another way to use PAC-Bayesian bounds is to compare f Rdp\ to the best possible 
risk, thus linking this approach to oracle inequalities. This is the point of view developed 
Catoni |2004 2007] , Dalalyan and Tsybakov [2008 . 


m 


Theorem 4.2 Assume that the Hoeffding assumption is satisfied (Definition \2.fi\ ). For 
any e > 0, with probability at least 1 — e we have simultaneously 


Rdp x < BxiMliQ)) := inf { Rdp + 2 

p&M \(©) J 


/(A,rc) + £(/?, tt) + log (f) 


and 


[ Rdpx < B X (R) := inf 


Mp + 2 /(A -" ) + C( ^ )+l0g( » ) |. 


Moreover, 

Bx(R) = Bx(Ml(@)) + 1 inf JC(p,ttx) 

A peJ 7 2 

where we remind that nx is defined in Definition \2.2\ 


In this way, we are able to compare f Rdpx to the best possible aggregation procedure 
in _A/f^(@) and f Rdpx to the best aggregation procedure in T. More importantly, we are 
able to obtain explicit expressions for the right-hand side of these inequalities in various 
models, and thus to obtain rates of convergence. This will be done in the remaining 
sections. This leads to the second interest of this result: if there is a A = A(n) that leads 
to £>a(A4:^(0)) < R + s n with s n —> 0 for the pseudo-posterior px, then we only have 
to prove that there is a p & T such that IC(p, vta)/A < cs n for some constant c > 0 to 
ensure that the VB approximation px also reaches the rate s n . 

We will see in the following sections several examples where the approximation does 
not deteriorate the rate of convergence. But first let us show the equivalent oracle 
inequality under the Bernstein assumption. 
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4.2. Bounds under the Bernstein assumption 

In this context the empirical bound on the risk would depend on the minimal achievable 
risk r m and cannot be computed explicitly. We give the oracle inequality for both the 
Gibbs posterior and its VB approximation in the following theorem. 


Theorem 4.3 Assume that the Bernstein assumption is satisfied (Definition 2.4)■ As¬ 
sume that A > 0 satisfies A — g(X,n) > 0. Then for any e > 0, with probability at least 
1 — e we have simultaneously: 


J Rdp x -R<B x (A^(0)), 
J Rdp x -R<B x (R), 


where, for either A = .M+(©) or A = T, 


B X {A) 


1 


\ t\ \ inf 

A - g(X,n) peA 


[A + g(X,n)] 


J(R - R)dp + 2JC(p, 7r) + 2 log 



In addition, 


B x {R) = Bx (M\(Q)) + 


inf 1C 


X - g(X,n) p&t 


Pt 7T A+g(A,n) 
2 


The main difference with Theorem 


4.2 


is that the function R(-) is replaced by R(-) — R. 


This is well known way to obtain better rates of convergence. 


5. Application to classification 

5.1. Preliminaries 

In all this section, we assume that y = {0,1} and we consider linear classification: © = 
X = fg(x) = 1 (M > 0 . We put r n {0) = ± E?=i R(0) = F(Y ^ fg(X)) 

and assume that the [(Xj,li)]JL 1 are i.i.d. In this setting, it is well-known that the 
Hoeffding assumption always holds. We state as a reminder the following lemma. 


Lemma 5.1 Hoeffding assumption ([!]) is satisfied with f(X,n ) = A 2 /(2n). 


The proof is given in Appendix [A] for the sake of completeness. 

It is also possible to prove that Bernstein assumption ([2]) holds in the case where the 
so-called margin assumption of Mamrnen and Tsybakov is satisfied. This condition we 
use was introduced by Tsybakov 2004 in a classification setting, based on a related 


definition in Mamrnen and Tsybakov [1999 
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Lemma 5.2 Assume that Mammen and Tsybakov’s margin assumption is satisfied: i.e. 
there is a constant C such that 


~ l/g(x)^y) 2 ] < C[R(0) - R}. 

Then Bernstein assumption © is satisfied with g(\,n ) = 2 ^ A . 
Remark 5.1 We refer the reader to Tsybakov f2004 / for a proof that 

< \(e,x) |< t) < c't 


for some constant C' > 0 implies the margin assumption. In words, when X is not 
likely to be in the region (0,X} — 0, where points are hard to classify, then the problem 
becomes easier and the classification rate can be improved. 


We propose in this context a Gaussian prior: n = A/d(0, D 2 Id), and we consider a VB 
approach based on Gaussian families. The corresponding optimization problem is not 
convex, but remains feasible as we explain below. 


5.2. Three sets of Variational Gaussian approximations 

Consider the three following Gaussian families 

X\ = m € M d ,cr 2 € M^} , 

J -‘2 = j < h mcr 2 , m E M d ,er 2 E (M^_) 2 | (mean field approximation), 

J~s = |d> m s, m G M d , E G 5 d+ | (full covariance approximation), 

where $ m ff 2 is Gaussian distribution Nfim, cr 2 /^), $ m <T 2 is Nfim, diag(cr 2 )), and 
is Nd(m , S). Obviously, T\ C C C .M+(©), and 

Ba(X^(0)) < B x (X 3 ) < B x (R 2 ) < BxiTi). (6) 

Note that, for the sake of simplicity, we will use the following classical notations in the 
rest of the paper: ip(-) is the density of jV(0,1) w.r.t. the Lebesgue measure, and $(•) 
the corresponding c.d.f. The rest of Section [5] is organized as follows. In Subsection |5.3[ 
we calculate explicitly B\(J- 2 ) and Bx^i). Thanks to ([6]) this also gives an upper bound 
on Bx(J~ 3 ) and proves the validity of the three types of Gaussian approximations. Then, 
we give details on algorithms to compute the variational approximation based on J~ 2 and 
J~ 3 , and provide a numerical illustration on real data. 


5.3. Theoretical analysis 

We start with the empirical bound for T 2 (and T\ as a consequence), which is a direct 
corollary of Theorem |4.1| 
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Corollary 5.3 For any e > 0, with probability at least 1 — e we have, for any m E M. d , 
a 2 E (M+) d , 


Rd<S> 


in. cr - — 


< / r n d4> ^2 + --h 


A Ef= 1 5 lo g(f^)+^ + ~ 5 + 1 °g(e) 


2 n 


A 


We now want to apply Theorem 4.2 in this context. In order to do so, we introduce 
an additional assumption. 


Definition 5.1 We say that Assumption A1 is satisfied when there is a constant c > 0 
such that, for any (6, 6') E © 2 with ||0||= ||0 , ||= 1, P((X, 6) ( X , O') < 0) < c\\6 — O' |. 

Note that this is not a stringent assumption. For example, it is satisfied as soon as 
X/||X|| has a bounded density on the unit sphere. 


Corollary 5.4 Assume that the VB approximation is done on either T\, J -2 or J- 3 . 
Take A = y/nd and 0 = Under Assumption Al, for any e > 0, with probability at 
least 1 — e we have simultaneously 


f Bdpx 
f Rdpx 


< R + \ - log (4ne 2 ) + —— + 
V n \/n 


, r^~ 1 2i °g(?) 

V4n3 + Vnd ' 


See the appendix for a proof. Note also that the values A = y/nd and 1 ? = ^ allow to 
derive this almost optimal rate of convergence, but are not necessarily the best choices 
in practice. 


Remark 5.2 Note that Assumption Al is not necessary to obtain oracle inequalities on 
the risk integrated under p\. We refer the reader to Chapter 1 in Catoni [2001] for such 


assumption-free bounds. However, it is clear that without this assumption the shape of p\ 
and p\ might be very different. Thus, it seems reasonable to require that Al is satisfied 
for the approximation of p\ by p\ to make sense. 


We finally provide an application of Theorem 4.3 Under the additional constraint 


that the margin assumption is satisfied, we obtain a better rate. 


Corollary 5.5 Assume that the VB approximation is done on either T\, J -2 or T 3 . Un¬ 


der Assumption Al (Definition \5.1\ page 10), and under Mammen and Tsybakov margin 
assumption, with A = an d "d > 0, for any e > 0, with probability at least 1 — e, 


jRdpx ] / n, t (C + 2 )(C + 1 ) fdlogg , 2d0 , 2 d , 2 , 2 ) , Vd2c{2C + 1 ) 

jRdpx}~ + 2 { n + n 2 + 0 dn + n g e} + n 


The prior variance optimizing the bound is 0 = d/(d + 2 + 2d/n), this choice or any 
constant instead will lead to a rate in dlog(n)/n. Note that the rate d/n is minimax- 
optimal in this context. This is, for example, a consequence of more general results 
Lecue |2007 under a general form of the the margin assumption. See the Appendix 


m 


for a proof. 
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5.4. Implementation and numerical results 

For family J ~2 (mean field), the variational lower bound ([ 5 ]) equals 


A 


<x) = - Y] <f> -Yi 

n ■ 1 \ 




T i d 

m m 1 


i=1 \ \JXidiag(a 2 )X t i ^ 

while for family J -3 (full covariance), it equals 


2d 


+ 9 log ^ 


k =1 


°k 

d 


A 


£ Ai1? (m,S) = — -Yi 


n 


i =1 


Aim 


T 

m m 




+ \ hog|S|-^trE 


Both functions are non-convex, but the multimodality of the latter may be more 
severe due to the larger dimension of J- 3 . To address this issue, we recommend to use 

which makes the dimension 


the reparametrisation of Opper and Archambeau 


of the latter optimisation problem 0(n ) 


see 



for a related approach. In 


both cases, we found that deterministic annealing to be a good approach to optimise 
such non-convex functions. We refer to Appendix B for more details on deterministic 
annealing and on our particular implementation. 

We now compare the numerical performance of the mean field and full covariance VB 


approximations to the Gibbs posterior (as approximated by SMC, see Section 3.1) for the 
classification of standard datasets; see Table [I] We also include results for a kernel SVM 
(support vector machine); this comparison is not entirely fair, since SVM is a non-linear 
classifier, while all the other classifiers are linear. Still, except for the Glass dataset, the 
full covariance VB approximation performs as well or better than both SMC and SVM 
(while being much faster to compute, especially compared to SMC). 


Dataset Covariates Mean Field (JF 2 ) Full cov. (JA) SMC SVM 


Pima 

7 

31.0 

21.3 

22.3 

30.4 

Credit 

60 

32.0 

33.6 

32.0 

32.0 

DNA 

180 

23.6 

23.6 

23.6 

20.4 

SPECTF 

22 

08.0 

06.9 

08.5 

10.1 

Glass 

10 

34.6 

19.6 

23.3 

4.7 

Indian 

11 

48.0 

25.5 

26.2 

26.8 

Breast 

10 

35.1 

1.1 

1.1 

1.7 


Table 1: Comparison of misclassification rates (%). 

Misclassification rates for different datasets and for the proposed approximations of the 
Gibbs posterior. The last column is the missclassification rate given by a kernel-SVM with 
radial kernel. The hyper-parameters are chosen by cross-validation. 
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Interestingly, VB outperforms SMC in certain cases. This might be due to the fact 
that a VB approximation tends to be more concentrated around the mode than the 
Gibbs posterior it approximates. Mean field VB does not perform so well on certain 
datasets (e.g. Indian). This may due either to the approximation family being too 
small, or to the corresponding optmisation problem to be strongly multi-modal. 


6. Application to classification under convexified loss 

Compared to the previous section, the advantage of convex classification is that the 
corresponding variational approximation will amount to minimising a convex function. 
This means that (a) the minimisation problem will be easier to deal with; and (b) we 
will be able to compute a bound for the integrated risk after a given number of steps of 
the minimisation procedure. 

The setting is the same as in the previous section, except that for convenience we now 
take V = {—1,1}, and the risk is based on the hinge loss, 

1 n 

r n(°) = - max(0,1 -Yi <0,Xi>). 

i=l 

We will write R H for the theoretical counterpart and R H for its minimum in 6. We 
keep the superscript H in order to allow comparison with the risk R under the 0 — 1 loss. 
We assume in this section that the X % are uniformly bounded by a constant, |AQ|< c x . 
Note that we do not require an assumption of the form (Al) to obtain the results of this 
section, as we rely directly on the Lipschitz continuity of the hinge risk. 


6.1. Theoretical Results 

Contrarily to the previous section, the risk is not bounded in 8, and we must specify a 
prior distribution for the Hoeffding assumption to hold. 

Lemma 6.1 Under a independent Gaussian prior tt such that each component is N (0, i) 2 ) , 
and for A < 2 \J^ and with bounded design \ X t j | < c x , Hoeffding assumption 0 is sat¬ 
isfied with /(A, n) = A 2 /(4 n) - \ log (l - d % n Cx ) . 

The main impact of such a bound is that the prior variance cannot be taken too big 
relative to A. 


Corollary 6.2 Assume that the VB approximation is done on either J-±, J ~2 or J- 3 . 
Take A = A- \Jand 1 ? = For any e > 0, with probability at least 1 — e we have 
simultaneously 


fR H dp x 

fR H dp x 




n 


^ R + — \ — log — + 2 c x —|- 


1 fc 2 x + l 


n 


xfnd \ 2 c x 


+ 2 c x log - 
e 
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The oracle inequality in the above corollary enjoys the same rate of convergence as 
the equivalent result in the preceding section. In the following we link the two results. 


Remark 6.1 As stated in the beginning of the section we can use the estimator specified 
under the hinge loss to bound the excess risk of the 0-1 loss. We write R* and R H * the 
respective risk for their corresponding Bayes classifiers. From Zhang \200 f / (section 3.3) 
we have the following inequality, linking the excess risk under the hinge loss and the 0 — 1 
loss, 

R(0) — R* < R h ( 0) - R h * 


By integrating with respect to p H (the VB approximation on any 


for every 9 £ 1 

F\,F‘ 2 ; of the Gibbs posterior for the hinge risk) and making use of Corollary 6.2 
have with high probability, 


we 


i H ( R{6 )) - R* < inf R H (6) - R H * 
ogrp 



6.2. Numerical application 

We have motivated the introduction of the hinge loss as a convex upper bound. In the 
sequel we show that the resulting VB approximation also leads to a convex optimization 
problem. This has the advantage of opening a range of possible optimization algorithms 
(Nesterov, 20041. In addition we are able to bound the error of the approximated measure 
after a fixed number of iterations (see Theorem 6.3). 

Under the model T\ each individual risk is given by: 


Pm,a(ri(0)) = (1 - Tim ) $ 


1 — Um 


a T. 


i 2 


+ p\\?i\W 


1 — T,m 


a T 


i 2 


writting U; := YjVi- 

Hence the lower bound to be maximized is given by 


m 

a 


A 


C(m , a) =-< V (1 — T;m) <f> 

n z —■' 


. i =1 


1 — T,m 


dir 


i 2 


+ y d|ri||<p 


i =1 


1 — Tim 

"dEr 


i m2 

\m\\i 


i 2 


d 


+ -(loga 2 -^ 


26 ' 2 

It is easy to see that the function is convex in (m, a), hrst note that the map 


T : 


- + yip - 


is convex and note that we can write 


m 

a 


= A 


+ b I hence by com¬ 


position of convex function with linear mappings we have the result. Similar reasoning 
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could be held for the case J~i and J- 3, where in later the parametrization should be done 
in C such that £ = CC t . The bound is however not universally Lipschitz in a, this 
impacts the optimization algorithms. 

On the class of function J -0 = |<3? m 1 ,m G for which our Oracle inequalities still 
hold we could get faster numerical algorithms. The objective function has Lipschitz 
continuous derivatives and we would get a rate of ■ 

Other convex loss could be considered which could lead to convex optimization prob¬ 
lems. For instance one could consider the exponential loss. 

Dataset Covariates Hinge loss SMC 


Pima 

7 

21.8 

22.3 

Credit 

60 

27.2 

32.0 

DNA 

180 

4.2 

23.6 

SPECTF 

22 

19.2 

08.5 

Glass 

10 

26.12 

23.3 

Indian 

11 

26.2 

25.5 

Breast 

10 

0.5 

1.1 


Table 2: Comparison of misclassification rates (%). 

Misclassification rates for different datasets and for the proposed approximations of the Gibbs 
posterior. The hyperparameters are chosen by cross-validation. This is to be compared to 
Table [T] 


Theorem 6.3 Assume that the VB approximation is done onJ~i,J -2 or J-3. Denote by 
Pk(d9) the VB approximated measure after the kth iteration of an optimal convex solver 
using the hinge loss. Take A = \fnd and 9 = ^ then under the hypothesis of Corollary 


6.2 with probability 1 — e 


R h dpk < B.“ + 


H 


LM c x [d, n d 1 (cL + 1 2 

-7T=T + + i\ - log l +2c *- + ^=i \ 0 -+ 2c - lo g- 

V 1 + k 2 V n d n Vnd \ 2 c x e 


where L is the Lipschitz coefficient on a ball of radius M of the objective function max¬ 
imized in VB. 


From Theorem 6.3 we can compute the number of iterations to get 
error at a given probability. 

We find that on average the misclassification error (Table [2]) is lower 
loss where we have no guaranties that the maximum is attained. 


a given level of 
than for the 0-1 
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7. Application to ranking 

7.1. Preliminaries 

In this section we take T = {0,1} and consider again linear classifiers: © = X = M rf , 
fe(x) = l{e,x)>o- We consider however a different criterion: in ranking, not only we want 
to classify well an object x, but we want to make sure that given two different objects, 
the one that is more likely to correspond to a label 1 will be assigned a larger score 
through the function fg. A usual way to measure this is to introduce the risk function 


R(0) = P[(Yi - Y 2 )(f e (X i) - f e (X 2 )) < 0] 

and the empirical risk 

r n {8) = 


1 


n(n _ D Y 1 {(y i -Y j )mx i )-fe(Xj))< 0 }- 


Then, again, we recall classical results. 

Lemma 7.1 The Hoeffding-type assumption is satisfied with f(X,n ) = Wy. 


The variant of the margin assumption adapted to ranking was established by Robbiano 
[2013 and Ridgway et al. [2014 


Lemma 7.2 Assume the following margin assumption: 

E [(%HW)-/ 0 (W)][Yi-Y 2 ]<O - 1 [/e(Xi)-4(X 2 )][Fi-Y 2 ]<o) 2 ] < C[R(0) - R\. 
Then Bernstein assumption ([2]) is satisfied with g(X,n) = A . 

We still consider a Gaussian prior 


d 

7r(d#) = JJ( / j(6'j;0,-d 2 )d6» i 
i —1 

and the approximation families will be the same as in Section 0 F\ = {V 2 > m G 
R d ,a 2 £ M+}, T-i = {4> m o . 2 ,m £ R d ,a 2 £ (M+) 2 } and F:<, = {4 > mi s,m £ M d , S £ S d+ }. 


7.2. Theoretical study 

Here again, we start with the empirical bound. 

Corollary 7.3 For any e > 0, with probability at least 1 — e we have, for any m £ 
a 2 £ (M + ) d ; 


Rd<Z> m (T 2 < 


id4> 


r 2 + 


n — 1 


+ 


sr^d 

2-,j=i 


\ lQ g (§ 


+ ^ 


+ 


tf 2 


-i+iog(i) 
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In order to derive a theoretical bound, we introduce the following variant of Assump¬ 
tion Al. 

Definition 7.1 We say that Assumption A2 is satisfied when there is a constant c > 0 
such that, for any (0,9') G 0 2 with ||0||= ||0 , ||= 1, P((A4 — X 2 ,0) (Ad — X-2,0') < 0) < 
c\\0-0'\\. 

Assumption A2 is satisfied as soon as (X\ — X 2 )/\\X\ — X 2 W has a bounded density on 
the unit sphere. 


Corollary 7.4 Use either T\, Jb or J- 3 . Take A = \j d ^ n 2 an d d = 1. Under (A2), 
for any e > 0, with probability at least 1 — e, 


J Rdp\ 

f Rdp x 


<R+fifT(l + \\og(2d(n 


1 ))) + 


0/2 

y/n — 1 


. 2 y/ 21 og(f) 

y(n- l)d 


Finally, under an additional margin assumption, we have: 


Corollary 7.5 Under Assumption A2 and the margin assumption of Lemma (7.2), for 


A = and d > 0, for any e > 0, with probability at least 1 — £, 


JUdh 1 (C+5)(C+1) f rilogg 2jg_ 2 d 2 2] 

f Rdp\ j 2 \ n — 1 n(n — 1) 1 ? — 1 n — 1 tj 

y/d4c(C + 1) 


The prior variance optimizing the bound is'!9 = d/(d + 2 + 2d/n). The proof is similar 


to the ones of Corollaries 5.4, 5.5 and 7.4 


As in the case of classification, ranking under an AUC loss can be done by replacing 
the indicator function by the corresponding upper bound given by an hinge loss. In this 
case we can derive similar results as for the convexified classification in particular we 
can get a convex minimization problem and obtain result without requiring assumption 
(A2). 


7.3. Algorithms and numerical results 


As an illustration we focus here on family T 2 (mean field). In this case the VB objective 
to maximize is given by: 


£(m, (j 2 ) 


n + n_ 


E 


r-Vi=l,3-yj=0 




l m ll : 

”2tT 


d r 


hE 


k= 1 L 


log Ok 


d J ’ 

(7) 


where T^ = X, — Xj, and where ( 7 ^) k are the elements of F. 

This function is expensive to compute, as it involves n+?r_ terms, the computation of 
which is 0(p). 
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We propose to use a stochastic gradient descent in the spirit of Hoffman et al. [2013 
The model we consider is not in an exponential family, meaning we cannot use the trick 
developed by these authors. We propose instead to use a standard descent. 

The idea is to replace the gradient by a unbiased version based on a batch of size B 
as described in Algorithm [4] in the Appendix. Robbins and Monro 119511 show that for 
a step-size (A*)* such that Aj < oo and At = oo the algorithm converges to a local 
optimum. 

In our case we propose to sample pairs of data with replacement and use the unbiased 
version of the derivative of the risk component. We use a simple gradient descent with¬ 
out any curvature information. One could also use recent research on stochastic quasi 
Newton-Raphson [Byrd et al. 2014]. 

For illustration, we consider a small dataset (Pima), and a larger one (Adult). The 
latter is already quite challenging with n+n_ = 193, 829, 520 pairs to compare. In both 
cases with different size of batches convergence is obtained with a few iterations only 
and leads to acceptable bounds. 

In Figure [l] we show the empirical bound on the AUC risk as a function of the iteration 
of the algorithm, for several batch sizes. The bound is taken for 95% probability, the 
batch sizes are taken to be B = 1,10, 20, 50 for the Pima dataset, and 50 for the Adult 
dataset. The figure shows an additional feature of VB approximation in the context of 
Gibbs posterior: namely the possibility of computing the empirical upper bound given 
by Corollary 7.3. That is we can check the quality of the bound at each iteration of the 


algorithm, or for different values of the hyperparameters. 


8. Application to matrix completion 

The matrix completion problem has received increasing attention recently, partly due to 
spectacular theoretical result s {C andes and Tao 2010 , and to challenging applications 


like the Netflix challenge (Bennett and Lanning, 2007 . In the perspective of this paper, 


the specific interest of this application is twofold. First, this is a case where the family of 
approximations is not parametric, but rather of the form ([3]), i.e. the family of products 
of independent components. Then, there is no known theoretical result for the Gibbs 
estimator in the considered model, yet we can still directly bound the loss induced by 
the variational approximation. 

We observe i.i.d. pairs ((Xj,Y)))” =1 where Xj £ {l,...,mi} x {1, ...,7712}, and we 
assume that there is a m-i x m2-matrix M such that Y) = Mx t + e* and the e* are 
centred. Assuming that X{ is uniform on {1,..., mi} X {1,..., m2 }, that fg(Xi) = 0x t , 
and taking the quadratic risk, R{6) = IE [(F) — #X;) 2 ], we have that 


R{6) - R = — 

m\m2' 

where ||-||^ stands for the Frobenius norm. 

A common way to parametrise the problem is 


-\\e-M\\ 2 F 


® = {6 = UV T , U £ R mixK , V £ M m2Xi ^} 
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Iterations Iterations 

(a) Pima (b) adult 

Figure 1: Error bound at each iteration, stochastic descent, Pima and Adult 
datasets. 

Stochastic VB with fixed temperature A = 100 for Pima and A = 1000 for adult. The left panel shows 
several curves that correspond to different batch sizes; these curves are hard to distinguish. The right panel 
is for a batch size of 50. The adult dataset has n = 32556 observation and n+n_ = 193829520 possible 
pairs. The convergence is obtained in order of seconds. The bounds are the empirical bounds obtained in 
Corollary |7.3|for a probability of 95%. 


where K is large; e.g. K = min(mi, m2). Following Salakhutdinov and Mnih 2008| , we 
define the following prior distribution: U.j ~ A/"(0,7j/), V.j ~ jV(0,7 jl) where the 7/s 
are i.i.d. from an inverse gamma distribution, 7 j ~ ZT(a, b). 

Note that VB algorithms were used in this context by Lint and Teh |2007j (with a 
slightly simpler prior however: the y^’s are fixed rather than random). Since then, this 
prior and variants were used in several papers [e.g. 


et ah, 2010 


Lawrence and Urtasun 2009 Zhou 


Until now, no theoretical results were proved up to our knowledge. Two 
papers prove minimax-optimal rates for slightly modified estimators (by truncation), for 
which efficient algorithms are unknown [Mai and Alquier 2015, Suzuki, 2014|. However, 


using Theorems |4 . 2| and 4.3 we are able to prove the following: if there is a PAC-Bayesian 
bound leading to a rate for p\ in this context, then the same rate holds for p\. In other 
words: if someone proves the conjecture that the Gibbs estimator is minimax-optimal 
(up to log terms) in this context, then the VB approximation will enjoy automatically 
the same property. 
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We propose the following approximation: 


m i 


m2 


T = \ p(d(U, V)) = H Ui(dUi,) n Vj( dVj,) 

3 = 1 


i= 1 


Theorem 8.1 Assume that M = UV T with ||< C. Assume that rank(M) = r 

so that we can assume that U. iT+ \ = • • • = U. t K = Td r +i = • • • = V..k = 0 (note that 
the prior n does not depend on the knowledge of r though). Choose the prior distri¬ 
bution on the hyper-parameters jj as inverse gamma Inv—r(a, b) with b < l/[2/3(mi V 
m 2 ) log(2AT(mi V m2))]. Then there is a constant C(a, C) such that, for any (3 > 0, 


inf JC(p, ng) < C(a, C) 

peT 


|r(mi + m 2 ) log [/3b(mi + m 2 )K] + 



See the Appendix for a proof. 

For instance, in Theorem 4.3, in classification and ranking we had A, A — g(X,n) and 
A + g(X, n) of order 0(n). In this case we would have: 


A - g(X,n) pe.F 


inf/C ^p,n \ +g (\,n) ^ = O ^ 


C(a, C)r(mi + m2) log [n6(mi + m2) A"] A 


n 


I ? 


and note that in this context it is know that the minimax rate is at least r(mi +m2)/n 
[Koltchinskii et al., 2011 . 


8.1. Algorithm 

As already mentioned, the approximation family is not parametric in this case, but rather 
of type mean field. The corresponding VB algorithm amounts to iterating equation Q, 
which takes the following form in this particular case: 


K 


u, 


(d U jt .) (x exp -- V.U.J [(Y Xi ~ ( UV T ) x f ] - 

I , 


7i 


k =1 
K 


VM Vi,) oc exp --J2 E v. it u [(YXi ~ (UV T ) x f] - 
n , 




k= 1 


1 

27 k. 

1 

27 k 


u. 


jk 


V- 


jk 


p{Ik) oc exp | E u u kj + E v v & ] + (« + 1) log 


1_ A 

Ik Ik 


where the expectations are taken with respect to the thus defined variational approxi¬ 
mations. One recognises Gaussian distributions for the first two, and an inverse Gamma 
distribution for the third. We refer to Lim and Teh 2007 for more details on this 
algorithm and for a numerical illustration. 
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9. Discussion 


We showed in several important scenarios that approximating a Gibbs posterior through 
VB (Variational Bayes) techniques does not deteriorate the rate of convergence of the 
corresponding procedure. We also described practical algorithms for fast computation of 
these VB approximations, and provided empirical bounds that may be computed from 
the data to evaluate the performance of the so-obtained VB-approximated procedure. 
We believe these results provide a strong incentive to recommend VB as the default 
approach to approximate Gibbs posteriors, in lieu of Monte Carlo methods. 

We hope to extend our results to other applications beyond those discussed in this 
paper, such as regression. One technical difficulty with regression is that the risk function 
is not bounded, which makes our approach a bit less direct to apply. In many papers 
on PAC-Bayesian bounds for regression, the noise can be unbounded (usually, it is 
assumed to be sub-exponential), but one assumes that the predictors are bounded, see 
3.g. Alquier and Biau |2013|. However, using the robust loss function of Audibert and 


Catoni, it is possible to relax this assumption Audibert and Catoni, 2011, Catoni 2012 


This requires a more technical analysis, which we leave for further work. 
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A. Proofs 


A.l. Preliminary remarks 

We start by a general remark. Let h be a function 0 —> M_|_ with f exp[— h(9)\7r(d9) < oo. 
Let us put 


ir[h](d9) = 


exp[— h(9)\ 


J exp[— h(0')]ir(d6‘ 
Direct calculation yields, for any p <C n with f hdp < oo, 


A 7 ^)- 


/C(p,7r[/i]) = A 


hdp + /C(p, 7r) + log 


exp(— hfdir. 


Two well known consequences are 


7r[/i] = arg min < / hdp + K,(p,n) 
p&m\(b) {J 

— log / exp(— h)d-K = min < / hdp + IC(p, 7 

J peMUe) [J 


We will use these inequalities many times in the followings. The most frequent applica¬ 
tion will be with h(9) = A r n (9) (in this case ir[\r n ] = p\) or h{9) = ±X[r n (9) — R(9)], 
the first case leads to 


K.(p,px) = A J r n dp + /C(p, 7r) + log J exp(-Ar n )d7r, 


( 8 ) 


P\ = arg min A / r n dp + JC(p, it) \ , (9) 

{ J J 

-log / exp(-Ar n )d7r = min <A / r n dp + IC(p, tt) > . (10) 

J peM\(&) { J J 

We will use ([8]), ([9]) and (10) several times in this appendix. 

A.2. Proof of the theorems in Subsection 14.11 


Proof of Theorem f.l. This proof follows the standard PAC-Bayesian approach (see Catoni 


12007]). Apply Fubini’s theorem to the first inequality of ([Tj): 

E J exp {A[i?(0) - r n (9)] - /(A, n )} ^(d#) < 1 
then apply the preliminary remark with h(9) = \[r n {9) — R(9)j: 

E exp < sup f X[R(9) - r n {9)}p{d9) - /C(p, n) - /(A, n) 1 <1. 

I p&MUe)J J 
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Multiply both sides by s and use E[exp(f7)] > P (U > 0) for any U to obtain: 


sup f \[R(0) - r n {0)]p(dd) - K{p, i r) - /(A, n) + log(e) > 0 
M 1 (&) j 


Then consider the complementary event: 


< e. 


VpG A J Rdp < X J r n dp + /(A,n) + K.(p, n) + log 


> 1 - e. 


□ 

Proof of Theorem Using the same calculations as above, we have, with probability 
at least 1 — s, simultaneously for all p E 

A J Rdp < A J r n dp + /(A,n) +/C(p,7r) + log 0 ^ (11) 

A J r n dp < A J Rdp + /(A, n ) + JC(p, 7r) + log 0^ . (12) 

We use 0 with p = p\ and ®> to get 

A f Rdp\ < inf { A [ r n dp + /(A, n) + K.(p, it) + log ( - 
J p&MUe) { J 


and plugging (12) into the right-hand side, we obtain 


A f Rdp\ < inf <A f Rdp + 2/(A, n) + 2JC(p, tt) + 2 log ( - 
J peM]_(&) l J V 

Now, we work with p\ = argmin pe jr JC(p, p\). Plugging ([8]) into © we get, for any p. 
A J Rdp < /(A, n) + /C(p, p x ) - log J exp(-Ar n )d7r + log 0^ . 


By definition of p\, we have: 


A J Rdpx < m0/(A,n) +/C(p,p A ) - log J exp(-Ar n )dvr + log ^ 

and, using ([8]) again, we obtain: 

A J Rdpx < jrh | A J r n dp + /(A, n) + /C(p, 7r) + log 0 


We plug (12) into the right-hand side to obtain: 


A J Rdp\ < mf <0 J Rdp + 2/(A, n) + 2/C(p, 7r) + 2 log ^ - 
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This proves the second inequality of the theorem. In order to prove the claim 


B X (F) = \ inf K(p,ttx), 

a peJ 7 2 


note that 


B\{F) = inf { f Rdp 4- ^ 

p£J- I J A A A 


= inf < — — log J exp ( — — R ) d7r + 


2/(A, n) 2/C(p,7Ta) 2log (|) 


+ 


+ 


= — - log / exp [ — ) d7r + 


2/(A,n) 2 log (|) 2 . 


H-+ -r inf K.(p,n\) 

A A peT 2 


= £a(A 4^(0)) + - inf fC{p,TTx). 

A p&T 2 

This ends the proof. □ 

A.3. Proof of Theorem |4.3| (Subsection 4~2j) 


Proof of Theorem f.3. As in the proof of Theorem |4.1| we apply Fubini, then (10) to 


the first inequality of ([2]) to obtain 
Eexp |sup J [A[i?(0) - R\ - A [r n {9) - r n ] - g(X,n)[R(9) - i?]] p{d6) - /C(p,7r)| < 1 
and we multiply both sides by e/2 to get 

'2 , _ . N 

— 2 (13) 


{ sup 

1 

V 

1 

(p 

/ Rdp — R 

> A 

/ r n dp - r n 

l p 


J 


J 


+IC(p, 7r)+log ( - 


We now consider the second inequality in ([2]): 

Eexp{A[r n (0)-r„] - \[R(6)-R] - g(\,n)[R(0)-R}} < 1. 
The same derivation leads to 

+/C(p,7r)+log 


< sup 

1 

V 

1 

(p 

3^ 

/ r n dp—r n 

> A 

1 

Cl 

l p 


J 


U J 


< ~ 2 . (14) 


We combine (13) and (14) by a union bound argument, and we consider the complemen¬ 
tary event: with probability at least 1 — e, simultaneously for all p e A4+(0), 


[A - g(X,n)] 


Rdp — R 


< A 


r n dp - r r 


+ lC(p, vr) + log 


(15) 
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r n dp - r r 


< [A + s(A,n)] 


Rdp — R 


+ JC(p, vr) + log 


(16) 


We now derive consequences of these two inequalities (in other words, we focus on the 
event where these two inequalities are satisfied). Using ®> in (fX5| yields 


[A - g(X, n)] 


f Rdpx - R 

< inf {A 

/ r n dp - r n 

[ J J 

p£M\(e) y 

u \ 


+ /C(p,7r) + log( - 


We plug (|16|) into the right-hand side to obtain: 
Rdp\ — R 


[A - g(X,n)] 


< in/ < [A + g{ A, n)\ 
P £MUe) 


Rdp — R 


+ 2JC(p,tt) + 2 log ( - 


Now, we work with p\. Plugging (|8|) into (|13|) we get 

[A — g(A,n)] J Rdp — R 

By definition of p\, we have: 


< £(P,P\) ~ iog 


J exp[—A( 


— A(r n - r n )]dvr + log ( - 


[A - g(\,n)] 


Rdp\ — R 


< mf_ | IC(p, p\) - log J exp[—A(r n - r n )]d7r + log ( - 


Then, apply p|) again to get: 


[A - g(X,n)] 


Rdp\ — R 


< inf {A J(r n - r n )dp + K.(p, it) + log | . 


Plug (16) into the right-hand side to get 


[A - g(X,n)} 


J Rdp\ — R 


< inf 
pS-F 


[A + g((A,n)] 


J(R - R)dp + 2/C(p, 7r) + 2 log 



□ 


A.4. Proofs of Section [5] 


Proof of Lemma 5.1. Combine Theorem 2.1 p. 25 and Lemma 2.2 p. 27 in Boucheron 
et al. 120131. □ 
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Proof of Lemma 5.2 Apply Theorem 2.10 in Boucheron et al. 2013 , and plug the 
margin assumption. □ 

Proof of Corollary 5.f. We remind that thanks to ([6]) it is enough to prove the claim for 
T\. We apply Theorem |4.2| to get: 


b ^=a{J 


Rd$ 


m ,cr 


. A /C(<h m(T 2,7r)+ log (f) 

2 H-h 2--- 

n A 


= inf 

(m,cr 2 ) 


Rd<f> r 


T 2 + - + 2 
n 


d 


5 lo s(^) + l^ + ~ i + lo g (I) 


Note that the minimizer of R, 9, is not unique (because fe(x) does not depend on ||0||) 
and we can chose it in such a way that ||0||= 1. Then 


R(9) R — E l(6»,x)y<o l(e>,x)y<o — ® 4 < 0 ,x)( e,x)<o 


= P ((6,X)(6,X) < 0) < c 


-9 


< 2c\\9-9\\. 


So: 


B\{J r i)<R+ inf 

( m . cr 2 ) 


2c 


\\9-9\\<S> m ^{d9) 



+ 

X 



i+iog(f) 


We now restrict the inhmum to distributions v such that m = 9: 


B{F\) <R + inf 

cr 2 


2cVda H-1- 

n 



d + 2 log ( 



We put a = and substitute ^ for d to get 


0(71) <R+ X + c ^ + dl °g( 4 f ) + ^ + ^ + 21 °g(f) 
~ n A 

Substitute \fnd for A to get the desired result. □ 

Proof of Corollary |5.4 We apply Theorem 4.3 


(R - R)dp x 


< inf 

m , CT 2 


A + g(A, n) 

2 


(R — R) d$> ma 2 + 


T-TT-T ( 2 X$m l( T 2 , VT) + 2 log - 

A-s(A,n) V e 
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where A < Computations similar to those in the the proof of Corollary 


5.4 


lead to 


Rdp\ < R + inf < 2c 


A + ff(A,n) 
A - g(X,n) 


\\9 - ew^m 


+ 2 - 


ZU 5 ^ 


+ *#-f+log© 


A — g(X, n) 


taking m = 6 and A = we get the result. □ 

A.5. Proofs of Section |6| 


Proof of Lemma 6.1 


For fixed 0 we can upper bound the individual risk such that: 


0 < max(0,1— < 0, Xi > Yi) < 1 + |< 6 , X{ > 


such that we can apply Hoeffding’s inequality conditionally on Xj and fixed 0. 
We get, 


E [exp (A (R h - r%)) |Xi, < exp t ^ + l< 9 ' Xi > 

l i =1 
A 2 A 2 c 2 


< exp < --f- 


u x \\ q \\2 


4 n 4 n 


where the last inequality stems from the fact that (a + b ) 2 < 2 (a 2 + fr 2 ) and the fact 
that we have supposed the X t to be bounded. We can take the expectation of this term 
with respect to the Xfs and with respect to our Gaussian prior. 


7r {E [exp [X(R H — rff ))] } < 

< 


°*p(£) 

( 27 t ) 2\/?? 2 

ex p (£) 


(2ir)$V¥ 



1 X 2 c 2 

The integral is a properly defined Gaussian integral under the hypothesis that X—> 

0 hence A < The integral is proportional to a Gaussian and we can directly 

write: 

7r {E [exp (X(R h -rf))]} < 

writing everything in the exponential gives the desired result. □ 


ex P (£) 

fZZZZK 

V An 
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Proof of Corollary\6.P\ We apply Theorem 42 to get: 


- os. {/- H 1 - ^) + ^■r i08(f) 


= (W) ^ I ^ - T l0 § ( 1 - 


2?i A 


dA 2 e 2 

4n 


+ 2 - 


Tr~^d 

2^7 = 1 


§log(g)+£ 


2\ 


+ ^-I + log(f) 


A 


. 


We use the fact that the hinge loss is Lipschitz and that the (X t ) are uniformly bounded 
||X||oo< c x - We get R h (9) < R H +c x Vd\\9 — 9\\ and restrict the infemum to distributions 
v such that m = 9: 


W. . , I , 2 . A 1 


B(^) < R +inf { c x da + - - - log I 1 - ^ 


+ rfl°g(gi) + ^ + |i-rf + 21°g(j) 


A 


We specify a 2 = ^== and A = c x ^f^ such that we get: 




V n 

2c x y/n 


X 77P + ffz ~ d + 2 log 


n 


To get the correct rate we take the prior variance to be i3 2 = ^ by replacing in the above 
equation we get the desired result. 

□ 

Proof of Theorem 6.3. From Nesterov 120041 (th. 3.2.2) we have the following bound on 
the objective function minimized by VB, (the objective is not uniformity Lipschitz) 


P k ( r n) + \fc(p k ^) - ^inf | p(r%) + i/C(p,vr)| < -^==. 


(17) 


We have from equation (11) specified for measures p k probability 1 — e, 


A J r%dp k < A j R H dp k + /(A, n) + JC(p k , vr) + log ^ 
Combining the two equations yields, 

/ RHipk - vm + l P in - A) + { p(r " ) + \ K ^ *>} + x log \ 

We can therefore write for any p 6 J-\, 

j R H dp k < ~^== + \f{n, A) + p(r%) + jJC(p, vr) + j log ^ 
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Using equation © a second time we get with probability 1 — e 

J R H dp k < ^J== + jf(n, A) + p{R H ) + jlC(p, vr) + A log ^ 

Because this is true for any p E T\ in 1 — e we can write the bound for the smallest 
measure in T\. 


R H dp k < 


LM 


\/l + k A" 


p£-U 


+ v/( n > A) + inf { p(R H ) + K{p , tt) + y log 


A 


A 


By taking the Gaussian measure with variance A and mean 8 in the infemum and taking 
A = d-p/nd and d = we can use the results of Corrolary 6.2 to get the result.□ 

A.6. Proofs of Section [7j 


Proof of Lemma \7.1\ The idea of the proof is to use Hoeffding’s decomposition of U- 
statistics combined with Hoeffding’s inequality for iid random variables. This was done 
in ranking by Clemengon et al. 2008 , and later in Robbiano |2013 , Ridgway et al. |2014 


for ranking via aggregation and Bayesian statistics. The proof is as follows: we define 

<li,i = 1 (Yi-Y j )Ue(X i )-f g (X j ))<o - R(0) 

so that 

E ql j =r n (9)-R(e). 


U n := 


1 


n(n — 1) 


hj 


From Hoeffding 11948 we have 


If J 


U n = 


-Y — Y 


I n | z _^ ^(i),7r(i+ LfJ) 

„ L2J j=i 

where the sum is taken over all the permutations n of {1,..., n}. Jensen’s inequality 
leads to 


LfJ 


Eexp[AC/ n ] = Eexp 

T) I ^' 


A—V — V 

n \ I n | 


< 

n! 


exp 


I -1 

[ 2J i= 1 


^7r(i),-7r(i+LfJ) 


A 


LfJ 


I n I E ^7r(i),7r(i+LfJ) 

L 2 J i=1 


We now use, for each of the terms in the sum we use the same argument as in the proof 

A 2 


of Lemma 5.1 to get 


Eexp[AC/ n l < — exp 
n! ' 


2LIJ 


2 J J 


< exp 


A 2 

n — 1 
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(in the last step, we used [_§ J > (n — l)/2). We proceed in the same way to upper bound 
Eexp[— XU n ]. □ 

Proof of Lemma | 7. 4 As already done above, we use Bernstein inequality and Hoeffding 
decomposition. Fix 6. We define this time 

Q f lj = 1{<0, Xi - Xj) (Yi - Yj) < 0} - 1 {[a(Xi) - a(Xj) ](F, - Yj) < 0} - R{0) + R 

so that 


Then, 


U n := r n (0) -f n - R(0) +R = 


If J 


1 


z^ 


n(n-l) 


n ! Z I ni Z^Ub+LfJ)' 

7r ^ 2 J j=l 


Jensen’s inequality: 

Eexp[AC/ n ] = Eexp 

<Ae e 

n! 


LfJ 


^Z M Z 4 ),b + Lfj) 

7T L 2J j=l 


exp 


A 


LfJ 


I n | Z ^7r(i),7r(i+Lf J) 

L 2 J i=1 


Then, for each of the terms in the sum, use Bernstein’s inequality: 


Eexp 


A LfJ 

I n I Z ®7r(i),7r(i+LfJ) 

L 2-l j = i 


< exp 


E((g e 


7r(i)^(i+L|T) 2 ) LfJ 

2 f 1 - 2 t#t 


We use again LfJ > (n—1)/2. Then, as the pairs (Xi, If) are iid, we haveE((qr^ 
E((qf 2 ) 2 ) and then E((qf 2 ) 2 ) < C[R{6) — R] thanks to the margin assumption. So 


Eexp 


LfJ 


I«I 

[ 2J i= 1 


Z q %i 


), 7 T(i+Lf J) 


< exp 


C[R(6)-R}£- 1 


1 - 


4A 
n —1 


This ends the proof of the proposition. □ 

Proof of Corollary \7-4\ The calculations are similar to the ones in the proof of Corol¬ 
lary [5J4] so we don’t give the details. Note that when we reach 

Bx{ ^ <r + jy + YI± 1 21 °g (t ) , 

n — 1 A 

an approximate minimization with respect to A leads to the choice A = . □ 
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A.7. Proofs of Section |8| 

Proof. First, note that, for any p, 


K(p, irp) = f3 J (R - R)dp + K.(p, 7 r) + log J exp [—/3(R - -R)] d7r 
< (3 f (R - R)dp + /C(p, 7r). 


Now, we define a subset of T that will be used for the calculation of the bound. We 
define for 6 > 0 the probability distribution puy^(d6) as 7r conditioned to 9 = pv T with 
p is uniform on {V(i,£), | pi t g — U^g |< 5 } and v is uniform on {V(j, £), \v^ — Vj/\< 5 }. 
Note that 

J (R ~ R)dpM,N,s = J E((9 X - M x ) 2 )pu,v,s(d9) 

< J 3E(((UV t ) x ~ M x ) 2 )p uy 4d(p,v)) 

+ 3 J E(((Uv T ) x - (UV T ) x ) 2 ) PU y S (d(p,v)) 

+ 3 J E(((pv T ) x - (Uv T ) x ) 2 )puys{d{p,is)). 

By definition, the first term is = 0. Moreover: 


E(((Uu t ) x - (UV T ) x ) 2 ) PU y S (d(p,v)) 

l 2 


1 


< 


777 - 1777-2 


777 1 7772 


£ 


i,j L k 


^ ^ Uj,k faj,k V jik ) 


Puy,s(d(p,v)) 


E 


i,j L k 




'y ~ u. fe ) 


L k 


Pu,v,s(d(p, v)) 


< KrC 2 5 2 . 


In the same way, 


J H((pv T )x - {Uv T ) x ) 2 )puy 5 (d(p,v)) < J \\p - U\\ 2 F \\v\\ 2 F puys(d(p,v)) 

< Kr(C + 5) 2 S 2 . 


So: 


J(R - R)dp M ,N,S < 2 Krd 2 (C + 5 2 ). 


Now, let us consider the term IC(puys, 7r). An explicit calculation is possible but tedious. 
Instead, we might just introduce the set G$ = {6 = pu T , \\p — U\\f< 5, \\v — y||_p< <5} 
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and note that JC(pu,v,s, 7r ) < log ■ An upper bound for Gs is calculated page 317-320 
Alquier 120141 and the result is given by (10) in this reference: 


K(pu,v,6,k) < 45 2 + 2||£/|||,+2||JV||£+21og(2) 


+ (mi + m 2 )r log I - 


1 /37r(mi V m2)K 


+ 2 1\ log 


r(a)3 a+1 exp(2) 
b a+1 2 a 


as soon as the restriction b < 


s 2 


— 2m\K log(2miii') ’ 2m,2K log(2m2 K) 

fc(pu,v,8i 7r /?) ^ /32Kr5 2 (C + 5 2 ) + 4<5 2 + 2||f7||jr+2|| N ||^+2 log(2) 

/ . , (1 1 37r(mi V m 2 )/\\ 

+ (mi + m 2 )r log - \ --- + 2I\ log 


is satisfied. So we obtain: 


r(a)3 a+1 exp(2) 
b a+l 2 a 


Note that ||C||^< C 2 rmi, || V||^< C 2 rm 2 and K < rn\ + m 2 so it is clear that the choice 

5 = \fl and b< 2 ^ (mi vm a) iog(2Jf(m 1 vm 2 )) leads to the existence of a constant C(a,C) 
such that 

fc(Pu,v,5, up) < C(a, C ) |r(mi + m 2 ) log [/3b(mi + m 2 )K} + ^ 


□ 


B. Implementation details 

B.l. Sequential Monte Carlo 

Tempering SMC approximates iteratively a sequence of distribution p \ t , with 

P\ t ( dd ) = exp (~\ t r n (e)) ir(d9), 

and temperature ladder Ao = 0 < ... < At = A. The pseudo code below is given for an 
adaptive sequence of temperatures. 
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Algorithm 1 Tempering SMC 


Input N (number of particles), t e (0,1) (ESS threshold), k > 0 (random walk tuning 
parameter) 

Init. Sample 9 l 0 ~ it^(9) for i = 1 to N, set t <— 1, Ao = 0, Z$ = 1. 

Loop a. Solve in \ t the equation 

AT r /\ \ '\ /am 

^ jf— r — \\ 9 ~i = riV ’ w t( e > = ex P[-( A i - \-i)r n {9)\ (18) 

using bisection search. If A t > At, set Zt = Zt_i x j-^ and 

stop. 

b. Resample: for i = 1 to A, draw A\ in 1 so that P(A£ = j) = 

™t(0t-i)/EfcLi™i(^-i); see Algorithm [ 2 ] in the appendix. 

c. Sample 9\ ~ Mj(0( 4 ( 1 ,d6 ) ) for i = 1 to N where Mt is a MCMC kernel that 

leaves invariant 7 q; see comments below. 

d. Set^ = Zi_ 1 x{^E£it« t (flj_i)}. 


The algorithm outputs a weighted sample (u;^, 9 l T ) approximately distributed as target 
posterior, and an unbiased estimator of the normalizing constant Z\ T . 

Step 


b. of algorithm B.l depends of a resampling algorithm 
Systematic resampling, described in Algorithm [2j 


We choose to use 
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Algorithm 2 Systematic resampling 

Input: Normalised weights W( := wt(d{_i)/YliLi w t(Qt_i)- 
Output: indices A 1 € {1,.. . , IV}, for i = 1_, N. 

a. Sample U ~ U{[ 0,1]). 

b. Compute cumulative weights as C n = J2m=i NW m . 

c. Set s i — U : m i — 1. 

d. For n = 1 : N 

While C m < s do m <— m + 1. 

A n 4— m, and s <— s + 1. 

End For 


For the MCMC step, we used a Gaussian random-walk Metropolis kernel, with a 
covariance matrix for the random step that is proportional to the empirical covariance 
matrix of the current set of simulations. 


B.2. Optimizing the bound 


A natural idea to find a global optimum of the objective is to try to solve a sequence of 
local optimization problems with increasing temperatures. For 7 = 0 the problem can be 
solved exactly (as a KL divergence between two Gaussians). Then, for two consecutive 
temperatures, the corresponding solutions should be close enough. 

This idea has been coined under several names. It has a long history in variational 
algorithm under the name deterministic annealing, Yuille 120101 uses it on mean field on 
Gibbs distribution for Markov random fields. In addition the intermediate results can 
be of interest in our case for selecting the temperature. One can compute the bound at 
almost no additional cost as a function of the current risk. In turns this can be used to 
monitor the bound. 
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Algorithm 3 Deterministic annealing 
Input (Ai) te [ 0 j T] a sequence of temperature 

Init. Set m = 0 and £ = 'did-, the values minimizing KL-divergence for A = 0 
Loop t=l,... ,T 

a. m At , £ At = Minimize £ Ai (m, S) using some local optimization routine with ini¬ 

tial points m At_1 , £ At_1 

b. Break if the empirical bound increases. 

End Loop 



Figure 2: Deterministic annealing on a Pima Indians with one covariate and full 
model resp. 

The right panel gives the empirical bound obtained for the DA method (in red) and the dot are direct global 
optimization based on L-BFGS algorithms from starting values drawn from the prior. Each optimization 
problem is repeated 20 times. 


We find that using a deterministic annealing algorithm with a limited amount of steps 
helps in finding a high enough optimum. On the left panel of Figure[2j we can see the one 
dimensional case where the initial problem 7 = 0 corresponds to a convex minimization 
problem and where the increasing temperature gradually complexifies the optimization 
problem. Figure [2] shows that the solution given by DA is in average lower than randomly 
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initialized optimization. 


C. Stochastic gradient descent 

The stochastic gradient descent algorithm used in Section ?? is described as Algorithm 

II 

Algorithm 4 Stochastic Gradient Descent 

Input B a batch size, an unbiased estimator of the gradient Vb/, 77 6 (0,1) and c 

While -iconverged 

a. x t+ i = x t - Xt^Bf(x t ) 

b. Update A m = 

End Loop 

In all our experiment we take c = 1 and 7 / = 0.9. 
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